First Bytes

Substitution Cipher Lab

Introduction: In this lab activity you will use MatLab and write functions to decrypt a message that is encoded with a Substitution cipher.

This lab is available at www.cs.utexas.edu/users/scottm/FirstBytes/substitutionLab.htm

A Substituion Cipher  is a more complicated and thus stronger cipher than a Caesar cipher. In a substitution cipher the clear letter is encoded to another letter chosen at random. To create a cipher pick the alphabet you wish to encode. Take another copy of that alphabet and randomize or shuffle them. Assign each letter in the random alphabet to a letter in the clear alphabet.

Again, assume we are only encoding letters. The alphabet is the 26 capital letters from A to Z. One possible random combination is:

BMFXGUDIQTNHWOEPAKZSCLJVYR

If we are only encoding the alphabet then there are 26! possible keys. (403,291,461,126,605,635,584,000,000)

Clear Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Encode Letter B M F X G U D I Q T N H W O E P A K Z S C L J V Y R

So to encrypt a message we just go through characters in the message and substitute the Encode letter. For all the labs this week will be making some simplifying assumptions. 

So to encrypt the message First Bytes! using the cipher above we go through character by character.

F is converted to U
i is converted to I then Q
r is converted to R then K
s is converted to S then Z
t is coverted to T then S
the space stays the same
B is converted to M
y is converted to Y then Y
t is converted to T then S
e is converted to E then G
s is converted to S then Z
! stays the same

So the encrypted message is UQKZS MYSGZ!

To decrypt the message the receiver would need the above key. Any letter is decoded by looking up its clear letter equivalent.

Use the code above to decode this message:

QZBGHHG QZ  SIKGG.


Substitution ciphers are harder to break than Caesar ciphers. Why? Due to the number of possible keys. For a 26 letter alphabet there are 403,291,461,126,605,635,584,000,000 possible keys. No matter how fast a computer Professor Burger makes, brute force is not an option here. Consider the following example. What if we can test 2,000,000,000 keys a second? How long to test all possible keys?  About 6.4 billion years. Close to the estimated age of the universe. 

But you have a friend and that is the frequency certain letters are used. In all known languages that use alphabets some characters or letters are used more frequently than others. How do you take advantage of that? 

  1. Count up the number of times each encoded character appears in the encoded message.

  2. Next, sort those characters in order of occurrence

  3. Next you need to know the language of the message and then the relative frequency that characters in that language appear in normal text. 

  4. Match up this list of normal letters with the encoded letters. This is your initial key. Use it to decode the message.

  5. Chances are good that the initial key will not be completely correct. Make changes to the key as you see fit based on how the message decodes. Eventually you will have the correct key and will have successfully decoded the message.

When applying frequency analysis to decrypt a message encoded with a substitution cipher you chances of successes increase as the amount of encoded message goes up. Why is this?

Example:

Here is the encoded message:

SIG NGY GHGWGOSZ EU FEWPCSGK ZFQGOFG BKG HEDQFBH SIQONQOD BOX SIG GHGDBOSHY XGZQDOGX ZEHCSQEOZ SE B JQXG LBKQGSY EU PKEMHGWZ. ZCFFGZZUCH FEWPCSGK ZFQGOSQZSZ JEKN BZ SGBWZ SE UQOX QOOELBSQLG ZEHCSQEOZ SE ZEFQGSY'Z PKEMHGWZ. UQKZS MYSGZ QZ BQWGX BS YECOD JEWGO JIE JBOS B ZOGBN PGBN QOSE ZFQGOFG SIBS IBZ FIBODGX ECK JEKHX. QZ QS UEK YEC? SIQON BMECS SIG UEHHEJQOD ACGZSQEOZ: XE YEC GOTEY SBNQOD B HBKDG PKEMHGW BOX SIQONQOD SIKECDI IEJ SE MKGBN QS CP BOX WBNG QS ZEHLBMHG? XE YEC HQNG XEQOD SIG GVPGKQWGOSZ QO FIGWQZSKY BOX MQEHEDY? XE YEC HQNG SE EKDBOQRG SIQODZ, G.D. SIG ZEFFGK SGBW XQOOGK, SIG PEZS-PKEW PBKSY? BKG YEC FKGBSQLG? XE YEC GOTEY HGBKOQOD OGJ JBYZ SE ZEHLG EHX PKEMHGWZ? XE GOTEY WBSI, PCRRHGZ, BOX CZQOD HEDQF SE UQDCKG SIQODZ ECS? QU YEC’LG IBX B FIBOFG SE SBNG FEWPCSGK ZFQGOFG EK PKEDKBWWQOD FHBZZ, XQX YEC HQNG QS? JGKG YEC DEEX BS QS? QU YEC FBO BOZJGK YGZ SE ZEWG EU SIGZG, UQKZS MYSGZ WQDIS MG TCZS JIBS YEC BKG HEENQOD UEK SIQZ ZCWWGK. JG GOFECKBDG YEC SE BPPHY SE UQKZS MYSGZ QU: YEC BKG B DQKH JIE QZ TCZS UQOQZIQOD GQSIGK YECK ZEPIEWEKG EK TCOQEK YGBK QO IQDI ZFIEEH, YEC IBLG XEOG JGHH QO WBSI BOX ZFQGOFG FHBZZGZ, YEC IBLG B DPB EU SIKGG PEQOS RGKE (EK GQDISY ECS EU EOG ICOXKGX), BOX YEC BOZJGKGX YGZ SE ZGLGKBH EU SIG ACGZSQEOZ HQZSGX BMELG. OE BXLBOFGX WBSI EK PKQEK PKEDKBWWQOD GVPGKQGOFG QZ KGACQKGX. JIBS QZ KGACQKGX QZ FCKQEZQSY BMECS SIG WEZS PEJGKUCH BKSQUBFS SIG ICWBO KBFG IBZ GLGK MCQHS. HGBKO WEKG MY BPPHYQOD SE UQKZS MYSGZ, SEXBY!

1. Counting the occurrences of each letter yields:

Encoded Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Number of Occurences 4 78 49 35 119 34 14 19 22 26 31 34 35 36 41 42 46 49 71 73 77 78 89 95 119 132

2. Sort the occurrences keeping the letter paired with the number of times it occurred.,

Encoded Letter  V A R T L N M J U P W F D X H Y I C K O Z B Q S E G
Number of Occurences 4 78 49 35 119 34 132 41 46 19 71 13 17 15 73 26 89 4 95 6 22 2 31 36 42 77

3. Assume the message is in English. The list of characters in ascending order by relative frequency in English is ZQXJKVBPYGFWMUCLDRHSNIOATE.

4. Match up the list of normal letters with the encoded letters to create the initial key:

Encoded Letter  V A R T L N M J U P W F D X H Y I C K O Z B Q S E G
Normal Letter Z Q X J K V B P Y G F W M U C L D E H S N I O A T E

The normal letter becomes the decode letter. To decode the text with this key it may be useful to sort the initial key by the Encoded Letter

Encoded Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Decode Letter Q I R M T W E C D P H K B V S G O X A J Y Z F U L N

5. Now decode the message using the initial key.

ADE VEL ECEFESAN TY WTFGRAEH NWOESWE IHE CTMOWIC ADOSVOSM ISU ADE ECEMISACL UENOMSEU NTCRAOTSN AT I POUE KIHOEAL TY GHTBCEFN. NRWWENNYRC WTFGRAEH NWOESAONAN PTHV IN AEIFN AT YOSU OSSTKIAOKE NTCRAOTSN AT NTWOEAL'N GHTBCEFN. YOHNA BLAEN ON IOFEU IA LTRSM PTFES PDT PISA I NSEIV GEIV OSAT NWOESWE ADIA DIN WDISMEU TRH PTHCU. ON OA YTH LTR? ADOSV IBTRA ADE YTCCTPOSM QRENAOTSN: UT LTR ESJTL AIVOSM I CIHME GHTBCEF ISU ADOSVOSM ADHTRMD DTP AT BHEIV OA RG ISU FIVE OA NTCKIBCE? UT LTR COVE UTOSM ADE EZGEHOFESAN OS WDEFONAHL ISU BOTCTML? UT LTR COVE AT THMISOXE ADOSMN, E.M. ADE NTWWEH AEIF UOSSEH, ADE GTNA-GHTF GIHAL? IHE LTR WHEIAOKE? UT LTR ESJTL CEIHSOSM SEP PILN AT NTCKE TCU GHTBCEFN? UT ESJTL FIAD, GRXXCEN, ISU RNOSM CTMOW AT YOMRHE ADOSMN TRA? OY LTR’KE DIU I WDISWE AT AIVE WTFGRAEH NWOESWE TH GHTMHIFFOSM WCINN, UOU LTR COVE OA? PEHE LTR MTTU IA OA? OY LTR WIS ISNPEH LEN AT NTFE TY ADENE, YOHNA BLAEN FOMDA BE JRNA PDIA LTR IHE CTTVOSM YTH ADON NRFFEH. PE ESWTRHIME LTR AT IGGCL AT YOHNA BLAEN OY: LTR IHE I MOHC PDT ON JRNA YOSONDOSM EOADEH LTRH NTGDTFTHE TH JRSOTH LEIH OS DOMD NWDTTC, LTR DIKE UTSE PECC OS FIAD ISU NWOESWE WCINNEN, LTR DIKE I MGI TY ADHEE GTOSA XEHT (TH EOMDAL TRA TY TSE DRSUHEU), ISU LTR ISNPEHEU LEN AT NEKEHIC TY ADE QRENAOTSN CONAEU IBTKE. ST IUKISWEU FIAD TH GHOTH GHTMHIFFOSM EZGEHOESWE ON HEQROHEU. PDIA ON HEQROHEU ON WRHOTNOAL IBTRA ADE FTNA GTPEHYRC IHAOYIWA ADE DRFIS HIWE DIN EKEH BROCA. CEIHS FTHE BL IGGCLOSM AT YOHNA BLAEN, ATUIL!

Not very promising, but let's look at the text and try some changes to the key. The first word right now is ADE. What three letter word starts more sentences than any other? THE. So, let's try the following change. What ever is decoding to A should decode to T and whatever was decoding to T should decode to A. S was decoding to A and E was decoding to T. Swap those two and the key becomes:

Encoded Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Decode Letter Q I R M A W E C D P H K B V S G O X T J Y Z F U L N

Decoding the message with the new key yields:

TDE VEL ECEFESTN AY WAFGRTEH NWOESWE IHE CAMOWIC TDOSVOSM ISU TDE ECEMISTCL UENOMSEU NACRTOASN TA I POUE KIHOETL AY GHABCEFN. NRWWENNYRC WAFGRTEH NWOESTONTN PAHV IN TEIFN TA YOSU OSSAKITOKE NACRTOASN TA NAWOETL'N GHABCEFN. YOHNT BLTEN ON IOFEU IT LARSM PAFES PDA PIST I NSEIV GEIV OSTA NWOESWE TDIT DIN WDISMEU ARH PAHCU. ON OT YAH LAR? TDOSV IBART TDE YACCAPOSM QRENTOASN: UA LAR ESJAL TIVOSM I CIHME GHABCEF ISU TDOSVOSM TDHARMD DAP TA BHEIV OT RG ISU FIVE OT NACKIBCE? UA LAR COVE UAOSM TDE EZGEHOFESTN OS WDEFONTHL ISU BOACAML? UA LAR COVE TA AHMISOXE TDOSMN, E.M. TDE NAWWEH TEIF UOSSEH, TDE GANT-GHAF GIHTL? IHE LAR WHEITOKE? UA LAR ESJAL CEIHSOSM SEP PILN TA NACKE ACU GHABCEFN? UA ESJAL FITD, GRXXCEN, ISU RNOSM CAMOW TA YOMRHE TDOSMN ART? OY LAR’KE DIU I WDISWE TA TIVE WAFGRTEH NWOESWE AH GHAMHIFFOSM WCINN, UOU LAR COVE OT? PEHE LAR MAAU IT OT? OY LAR WIS ISNPEH LEN TA NAFE AY TDENE, YOHNT BLTEN FOMDT BE JRNT PDIT LAR IHE CAAVOSM YAH TDON NRFFEH. PE ESWARHIME LAR TA IGGCL TA YOHNT BLTEN OY: LAR IHE I MOHC PDA ON JRNT YOSONDOSM EOTDEH LARH NAGDAFAHE AH JRSOAH LEIH OS DOMD NWDAAC, LAR DIKE UASE PECC OS FITD ISU NWOESWE WCINNEN, LAR DIKE I MGI AY TDHEE GAOST XEHA (AH EOMDTL ART AY ASE DRSUHEU), ISU LAR ISNPEHEU LEN TA NEKEHIC AY TDE QRENTOASN CONTEU IBAKE. SA IUKISWEU FITD AH GHOAH GHAMHIFFOSM EZGEHOESWE ON HEQROHEU. PDIT ON HEQROHEU ON WRHOANOTL IBART TDE FANT GAPEHYRC IHTOYIWT TDE DRFIS HIWE DIN EKEH BROCT. CEIHS FAHE BL IGGCLOSM TA YOHNT BLTEN, TAUIL!

See all the TDEs in the text. That is probably THE so change whatever was decoding to D to H. New key is:

Encoded Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Decode Letter Q I R M A W E C H P D K B V S G O X T J Y Z F U L N

New decoded message:

THE VEL ECEFESTN AY WAFGRTED NWOESWE IDE CAMOWIC THOSVOSM ISU THE ECEMISTCL UENOMSEU NACRTOASN TA I POUE KIDOETL AY GDABCEFN. NRWWENNYRC WAFGRTED NWOESTONTN PADV IN TEIFN TA YOSU OSSAKITOKE NACRTOASN TA NAWOETL'N GDABCEFN. YODNT BLTEN ON IOFEU IT LARSM PAFES PHA PIST I NSEIV GEIV OSTA NWOESWE THIT HIN WHISMEU ARD PADCU. ON OT YAD LAR? THOSV IBART THE YACCAPOSM QRENTOASN: UA LAR ESJAL TIVOSM I CIDME GDABCEF ISU THOSVOSM THDARMH HAP TA BDEIV OT RG ISU FIVE OT NACKIBCE? UA LAR COVE UAOSM THE EZGEDOFESTN OS WHEFONTDL ISU BOACAML? UA LAR COVE TA ADMISOXE THOSMN, E.M. THE NAWWED TEIF UOSSED, THE GANT-GDAF GIDTL? IDE LAR WDEITOKE? UA LAR ESJAL CEIDSOSM SEP PILN TA NACKE ACU GDABCEFN? UA ESJAL FITH, GRXXCEN, ISU RNOSM CAMOW TA YOMRDE THOSMN ART? OY LAR’KE HIU I WHISWE TA TIVE WAFGRTED NWOESWE AD GDAMDIFFOSM WCINN, UOU LAR COVE OT? PEDE LAR MAAU IT OT? OY LAR WIS ISNPED LEN TA NAFE AY THENE, YODNT BLTEN FOMHT BE JRNT PHIT LAR IDE CAAVOSM YAD THON NRFFED. PE ESWARDIME LAR TA IGGCL TA YODNT BLTEN OY: LAR IDE I MODC PHA ON JRNT YOSONHOSM EOTHED LARD NAGHAFADE AD JRSOAD LEID OS HOMH NWHAAC, LAR HIKE UASE PECC OS FITH ISU NWOESWE WCINNEN, LAR HIKE I MGI AY THDEE GAOST XEDA (AD EOMHTL ART AY ASE HRSUDEU), ISU LAR ISNPEDEU LEN TA NEKEDIC AY THE QRENTOASN CONTEU IBAKE. SA IUKISWEU FITH AD GDOAD GDAMDIFFOSM EZGEDOESWE ON DEQRODEU. PHIT ON DEQRODEU ON WRDOANOTL IBART THE FANT GAPEDYRC IDTOYIWT THE HRFIS DIWE HIN EKED BROCT. CEIDS FADE BL IGGCLOSM TA YODNT BLTEN, TAUIL!

Looks good. What changes next? We think the T, H, and E are all right. See all the TAs? Maybe A should be O.

Encoded Letter  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Decode Letter Q I R M O W E C H P D K B V S G A X T J Y Z F U L N

THE VEL ECEFESTN OY WOFGRTED NWAESWE IDE COMAWIC THASVASM ISU THE ECEMISTCL UENAMSEU NOCRTAOSN TO I PAUE KIDAETL OY GDOBCEFN. NRWWENNYRC WOFGRTED NWAESTANTN PODV IN TEIFN TO YASU ASSOKITAKE NOCRTAOSN TO NOWAETL'N GDOBCEFN. YADNT BLTEN AN IAFEU IT LORSM POFES PHO PIST I NSEIV GEIV ASTO NWAESWE THIT HIN WHISMEU ORD PODCU. AN AT YOD LOR? THASV IBORT THE YOCCOPASM QRENTAOSN: UO LOR ESJOL TIVASM I CIDME GDOBCEF ISU THASVASM THDORMH HOP TO BDEIV AT RG ISU FIVE AT NOCKIBCE? UO LOR CAVE UOASM THE EZGEDAFESTN AS WHEFANTDL ISU BAOCOML? UO LOR CAVE TO ODMISAXE THASMN, E.M. THE NOWWED TEIF UASSED, THE GONT-GDOF GIDTL? IDE LOR WDEITAKE? UO LOR ESJOL CEIDSASM SEP PILN TO NOCKE OCU GDOBCEFN? UO ESJOL FITH, GRXXCEN, ISU RNASM COMAW TO YAMRDE THASMN ORT? AY LOR’KE HIU I WHISWE TO TIVE WOFGRTED NWAESWE OD GDOMDIFFASM WCINN, UAU LOR CAVE AT? PEDE LOR MOOU IT AT? AY LOR WIS ISNPED LEN TO NOFE OY THENE, YADNT BLTEN FAMHT BE JRNT PHIT LOR IDE COOVASM YOD THAN NRFFED. PE ESWORDIME LOR TO IGGCL TO YADNT BLTEN AY: LOR IDE I MADC PHO AN JRNT YASANHASM EATHED LORD NOGHOFODE OD JRSAOD LEID AS HAMH NWHOOC, LOR HIKE UOSE PECC AS FITH ISU NWAESWE WCINNEN, LOR HIKE I MGI OY THDEE GOAST XEDO (OD EAMHTL ORT OY OSE HRSUDEU), ISU LOR ISNPEDEU LEN TO NEKEDIC OY THE QRENTAOSN CANTEU IBOKE. SO IUKISWEU FITH OD GDAOD GDOMDIFFASM EZGEDAESWE AN DEQRADEU. PHIT AN DEQRADEU AN WRDAONATL IBORT THE FONT GOPEDYRC IDTAYIWT THE HRFIS DIWE HIN EKED BRACT. CEIDS FODE BL IGGCLASM TO YADNT BLTEN, TOUIL!

What next? About 10 more changes would yield the final key.

Here are some functions you may find useful in this lab

Function Name Description
loadtext(filename) This is a nonstandard MatLab function that you must load into your work directory to use.

loads all data from the file with name filename. 
Filename is a string. 
Returns a string

Example:
message = loadtext('caesarCoded.txt')

length(X) determines how many elements are in each column of a matrix
X is any variable
returns an integer

Example:
len = length(message)

blanks(num) creates a string consisting of num spaces
num is an integer
returns a string with num spaces

Example:
result = blanks(len)
matvar' Transposes a matrix variable so each column becomes a row. For example given

matvar = 1 2 3
         4 5 6

matvar = matvar'

matvar = 1 4
         2 5
         3 6
sortrows(matvar) sorts the rows of a matrix in ascending order based on the first column. For example given

matvar = 2 5
         3 6
         1 4
         4 8

matvar = sortrows(matvar)

matvar = 1 4
         2 5
         3 6
         4 8

Approach: The best way to solve the problem is in small chunks, following the steps outlined above. To review:

1. Create a function that accepts a single String or message as input. The function will need an array of 26 cells to store the result. Each cell should start at 0. Cell 1 will hold the number of 'A's in the message, cell 2 will hold the number of 'B's in the message and so forth. The function should loop through the message and examine every character. If the character is a letter then add one to the cell in the result that corresponds to that letter.

2. Take the array of frequencies from step 1 and create the initial key. This is the most complicated step.

3. To create the initial key create a matrix with 2 rows and 26 columns. The first row is the frequencies, and the second is the ASCII codes for the capital letters they correspond to. Assuming the array freqs holds the results of the frequencies. One way of doing this is

key = [freqs;65:90]

4. Transpose this matrix with the transpose command

key = key'

5. Use the sortrows function to sort the key

6. Transpose the matrix again, back to its original configuration. 

7. The second row of the matrix now corresponds to the letters in the message in increasing order of occurrence. The frequencies are no longer needed.

8. Build a new matrix with the letters from the message in the first row and the list of letters sorted in ascedning order by frequency in standard English as the second row.

9. Transpose this matrix, sort it with sortrows, and transpose it again. You will now have your initial key. You will probably want one function to count the frequencies of the letters and another that uses that function to create this initial key.

10. Decode the message with your initial key. For every letter in the original message substitute the letter from the initial key.

11. Complete a method that allows changes to the key. you should specify the what the old decode letter is, what the new decode letter should be, and the key itself. The function will return the new key, by swapping the two values in the old key.

 

File Description Link - Download all files to your work folder in MatLab
M file to load data from text files into a String variable. loadtext.m
A portion of a story encoded using a substitution cipher subMessage.txt
A list of letters sorted in ascending order by frequency in standard English text freqLetters.txt